Add guidance for CO HDF/NetCDF #121

abarciauskas-bgse · 2024-11-13T19:11:07Z

Adds long overdue and much requested guidance on cloud-optimizing HDF(5) and NetCDF(-4).

I've added as co-authors @ajelenak and @ashiklom but also cited @bilts @betolink and @andypbarrett, so tagging all for review.

github-actions · 2024-11-14T22:28:58Z

PR Preview Action v1.4.8
🚀 Deployed preview to https://cloudnativegeo.github.io/cloud-optimized-geospatial-formats-guide/pr-preview/pr-121/
on branch `gh-pages` at 2024-12-20 21:21 UTC

wildintellect · 2024-11-16T00:04:15Z

@abarciauskas-bgse this is a great 1st version, a few questions and suggested fixes

Fix: Compression - is currently a subheading under Consolidated Metadata
Q: When talking about optimum chunk size is this compressed? Since compressed chunks should be delivered, I would think you want to target compressed sizes.
Q: Additional Research, Chuck's example was on a non-cloud-opt HDF5, that's probably important.
Fix: "How to check chunk size and shape" is missing output and explanation of how to read the output
Q: Do we want to reference Zarr/Chunking in some way as alternatives, Zarr for when cloud native is fine and you don't need a single "archival file", Chucking (e.g. Kerchunk) when you want an index around and existing file you don't want to or can't change.

TODO: We'll open a different ticket for a notebook page about writing files from Python etc... rather than always having to repack existing files all the time.

wildintellect

A couple of fixes #121 (comment)

betolink · 2024-11-21T17:53:43Z

This looks good! just some minor additions that I'm not sure are relevant for a first pass on this.

As of now, page size is across the board, this means that a user will have to find a balanced page size that reduces the metadata requests to the minimum at the same time that the unused size in the data chunks do not increase the file size by a lot... we noticed this for IS2 ATL03; example: file size 6GB, total metadata: ~20MB, if we use 8MB pages, the file size will increase by 1% approx, but for smaller files e.g. <1GB the 8MB page size increased the file size by ~10%. and this % varies depending on the page size vs data chunk size ratios. I short, a user should be careful to pick a page size as it's dataset dependent.
The official HDF5 library needs to be configured to use the page aggregated files, I think @ajelenak said that this will change in March when the HDF Group releases the next major version.
HDF5 doesn't have a geo spec for spatial chunking, at the lowest level if a user needs to subset data, the HDF5 library will have to load all the chunks of a dataset to create an index and use it to subset (e.g. lat lon subsetting), in order to take CO-HDF5 to the next level each chunk in the file should have a polygon/bbox info and it should be indexed in a way that the drivers can understand. This is related to over-reads, e.g. our data chunks are ~1MB per chunk, we use 8MB pages and in a subsetting operation we only need 2 chunks from contiguous pages... will be requesting 16MB instead of 2MB. @ajelenak can confirm if this is true. and @bilts also mentioned it on his ESIP talk.
On creating vs repacking:
- h5repack has some limitations on rechuking, the aggregation is fast but rechunking is really slow. Usually we want to increase chunk sizes so for ATL03 we had (Jeff Lee had to) use the h5py API directly to repack the files. This example can be used to see how to create CO-HDF5 with Python too: https://github.com/nsidc/cloud-optimized-icesat2/blob/main/notebooks/optimize-atl03.py
- The netcdf library doesn't expose the low level hdf5 API so creating cloud optimized netcdf has to be done with h5repack or Python after the file is created. see: using the HDF5 file space strategy property Unidata/netcdf-c#2871 (reply in thread)

If I can think of more stuff I'll add it later (will be out next week). We also need to finish our tech report on IS2 and CO-HDF5, I think it should be ready for AGU.

abarciauskas-bgse · 2024-11-25T19:07:19Z

@betolink @wildintellect thanks for the feedback. I have some AGU prep to do but once that is done I will address the comments.

…is not cloud-optimized

abarciauskas-bgse · 2024-12-16T22:16:42Z

@betolink thank you so much for these detailed comments. I have some comments and questions I'm hoping to help me sort out the details...

As of now, page size is across the board, this means that a user will have to find a balanced page size that reduces the metadata requests to the minimum at the same time that the unused size in the data chunks do not increase the file size by a lot... we noticed this for IS2 ATL03; example: file size 6GB, total metadata: ~20MB, if we use 8MB pages, the file size will increase by 1% approx, but for smaller files e.g. <1GB the 8MB page size increased the file size by ~10%. and this % varies depending on the page size vs data chunk size ratios. I short, a user should be careful to pick a page size as it's dataset dependent.

I'm reading a bit more documentation and now I am confused - so I'm hoping to clarify. Using h5repack it appears there is just one FS_PAGESIZE argument that can be set, to be used in combination with FS_STRATEGY=PAGE. But then I found in the hdf5 library documentation there is both H5Pset_small_data_block_size and H5Pset_meta_block_size which leads me to believe you can set the metadata page size separately from the raw data block size (and more specifically, it's a block size for "small blocks", so I'm assuming that just means when multiple raw datasets can fit into one block). Do you know if using h5repack it uses the FS_PAGESIZE for both metadata and small raw data datasets?

Secondly, I think if I had this level of detail I should also clarify that pages are different from chunks. If I understand correctly, HDF5 will create "pages" of data using the page size but then the raw data itself could also be chunked, so presumably the chunks will always be smaller than the page sizes. Is this a correct understanding?

Also, are we concerned about increases in file size purely from a cost for storage perspective? My understanding was that for performance total file size doesn't matter, as long as we can just grab reasonably sized chunks from the file.

The official HDF5 library needs to be configured to use the page aggregated files, I think @ajelenak said that this will change in March when the HDF Group releases the next major version.

I see in the hdf5 library there is H5Pset_page_buffer_size and in h5py you can set page_buf_size - are you saying these arguments are working to use the page aggregated files or that using this page buffer size setting is not set?

HDF5 doesn't have a geo spec for spatial chunking, at the lowest level if a user needs to subset data, the HDF5 library will have to load all the chunks of a dataset to create an index and use it to subset (e.g. lat lon subsetting), in order to take CO-HDF5 to the next level each chunk in the file should have a polygon/bbox info and it should be indexed in a way that the drivers can understand. This is related to over-reads, e.g. our data chunks are ~1MB per chunk, we use 8MB pages and in a subsetting operation we only need 2 chunks from contiguous pages... will be requesting 16MB instead of 2MB. @ajelenak can confirm if this is true. and @bilts also mentioned it on his ESIP talk.

This is interesting but I'm still trying to understand it and so not sure how to include it in a way that is useful to readers. I think for the purposes of this first draft I will omit it if that's ok with you.

On creating vs repacking:

h5repack has some limitations on rechuking, the aggregation is fast but rechunking is really slow. Usually we want to increase chunk sizes so for ATL03 we had (Jeff Lee had to) use the h5py API directly to repack the files. This example can be used to see how to create CO-HDF5 with Python too: https://github.com/nsidc/cloud-optimized-icesat2/blob/main/notebooks/optimize-atl03.py

The netcdf library doesn't expose the low level hdf5 API so creating cloud optimized netcdf has to be done with h5repack or Python after the file is created. see: using the HDF5 file space strategy property Unidata/netcdf-c#2871 (reply in thread)

I'll add these in as notes.

If I can think of more stuff I'll add it later (will be out next week). We also need to finish our tech report on IS2 and CO-HDF5, I think it should be ready for AGU.

Is the tech report out? If so I will link to it for sure.

…d-optimized-geospatial-formats-guide into add-co-hdf5-guidance

ashiklom · 2024-12-17T14:18:23Z

Chiming in, since I'm actively working on applying this guidance to some next-generation data products from GMAO. Others, please correct me if I'm wrong!

My mental model of paged aggregation is that, when enabled, it's basically the smallest unit of data that HDF5 can read or write; i.e., you can't read or write part of a page. All the consequences of inappropriately set page sizes flow from that.

believe you can set the metadata page size separately from the raw data block size (and more specifically, it's a block size for "small blocks", so I'm assuming that just means when multiple raw datasets can fit into one block). Do you know if using h5repack it uses the FS_PAGESIZE for both metadata and small raw data datasets?

I've never seen any HDF5 person mention different page sizes for metadata vs. (chunked) data. I think the two things you're linking to here refer to a different (not page-based) storage management strategy. But, it would be awesome of HDF5 could have more flexible page sizes!

If I understand correctly, HDF5 will create "pages" of data using the page size but then the raw data itself could also be chunked, so presumably the chunks will always be smaller than the page sizes. Is this a correct understanding?

My understanding is: Chunk sizes should be smaller than page sizes, but I don't think it's required; you can split chunks across multiple pages. Otherwise, HDF5's tiny default page size (4 KB?) would fail for most datasets.

Also, are we concerned about increases in file size purely from a cost for storage perspective? My understanding was that for performance total file size doesn't matter, as long as we can just grab reasonably sized chunks from the file.

My guess is that there might be a minor performance penalty for retrieving unused data (because you have to download/read more data than you actually need), but it'll be negligible in most cases. So yes, the primary concern with large page sizes is that they inflate overall file size (and therefore storage cost). But, since lots of NASA data are big, that's a very important consideration! A 10% increase in NASA's ~140 PB catalog is ~14 PB, which is multiple big missions'-worth of data!

abarciauskas-bgse · 2024-12-17T19:45:49Z

@ashiklom thank you so much for chiming in! These thoughts are super helpful and interested to know how the GMAO product development goes.

My mental model of paged aggregation is that, when enabled, it's basically the smallest unit of data that HDF5 can read or write; i.e., you can't read or write part of a page. All the consequences of inappropriately set page sizes flow from that.

That is a helpful simplification, thank you.

I've never seen any HDF5 person mention different page sizes for metadata vs. (chunked) data. I think the two things you're linking to here refer to a different (not page-based) storage management strategy. But, it would be awesome of HDF5 could have more flexible page sizes!

I think you're right, that these API methods are for a different file space management strategy.

My understanding is: Chunk sizes should be smaller than page sizes, but I don't think it's required; you can split chunks across multiple pages. Otherwise, HDF5's tiny default page size (4 KB?) would fail for most datasets.

👍🏽

My guess is that there might be a minor performance penalty for retrieving unused data (because you have to download/read more data than you actually need), but it'll be negligible in most cases. So yes, the primary concern with large page sizes is that they inflate overall file size (and therefore storage cost). But, since lots of NASA data are big, that's a very important consideration! A 10% increase in NASA's ~140 PB catalog is ~14 PB, which is multiple big missions'-worth of data!

👍🏽

I have incorporated most of these comments into a new box HDF5 File Space Management Strategies

betolink · 2024-12-18T03:35:16Z

I concur with all the things @ashiklom said.

If chunk sizes are larger than page sizes they will be tracked separately, so page aggregation won't be applied to them, which is bad. I want to dive into a geo-spec for HDF5, how can we rechunk different collections and add the geo-metadata to improve access even more, something I talked to Aleksandar and Patrick.

The technical report on ATL03 is almost there, I think I'll use the holidays to finish it. I'm not sure about funding yet but after talking to Brianna(NASA) I think the Cloud Native summit in April would be a great place to present it.

abarciauskas-bgse · 2024-12-18T21:06:08Z

@betolink thank you for sharing the tech report, it looks great.

Just one question:

If chunk sizes are larger than page sizes they will be tracked separately, so page aggregation won't be applied to them.

Do you mean that there will be both metadata on pages AND chunks? And why is this bad, besides an increase in metadata? Is it because of chunk over-reading when reading multiple pages? Sorry this is the first time I'm hearing about this and curious about how it works. In the technical report it says "Chunk sizes cannot be larger than the page size", which seems contradictory to what we are discussing here (that chunk size can be larger than page sizes but it slows things down).

abarciauskas-bgse · 2024-12-19T02:17:18Z

@wildintellect ok I have incorporated comments to date. I am happy to merge and publish and we can update with new feedback as it arrives.

wildintellect

I only have 1 minor question. Is it bad to drop all references to alternatives to Cloud Optimizing (aka services) like Hyrax, OpenDap etc? Should we be saying why we think cloud optimized is better but that these do exist as alternatives?

betolink · 2024-12-20T19:07:28Z

I only have 1 minor question. Is it bad to drop all references to alternatives to Cloud Optimizing (aka services) like Hyrax, OpenDap etc? Should we be saying why we think cloud optimized is better but that these do exist as alternatives?

Good question, I see cloud native formats as a better long-term solution to transformation services, although these services are needed in some cases, they too will benefit of the data being in cloud optimized formats.

"Chunk sizes cannot be larger than the page size", which seems contradictory to what we are discussing here

They can be larger but the driver won't access the chunks using the single page size approach, they will be as if they were in a regular HDF5, not bad if chunk sizes are really large. We only ran into one case for ICESat-2, the page size being 8MB and a dataset had 10 MB chunks, the smaller chunks were grouped into pages, the 10 MB were not. Since they are big enough the performance was not degraded. I think it was one of the 2 atmospheric datasets.

abarciauskas-bgse · 2024-12-20T21:04:07Z

We actually do cover services generally in the home page:

While it is possible to provide subsetting as a service, this requires ongoing maintenance of additional servers and extra network latency when accessing data (data has to go to the server where the subsetting service is running and then to the user). With cloud-optimized formats and the appropriate libraries, subsets of data can be accessed directly from an end user's machine without introducing an additional server.

But I think adding this sentence I just added to the introduction of the CO HDF5/NetCDF-4 strengthens the intro by providing a reason for cloud-optimizing:

Cloud-optimized formats provide efficient subsetting without maintaining a service, such as OpenDAP, Hyrax or SlideRule.

abarciauskas-bgse · 2024-12-20T21:19:05Z

Thanks @betolink for helping out here - I hope you don't mind me pursuing this question about chunk sizes and page sizes. My reasoning may be wrong or I'm missing a scenario but I'm not sure I understand how having chunks larger than page sizes would degrade read performance, here are some scenarios:

chunks size is a multiple of page size. This seems good and fine because 1 or more chunks can be grouped into 1 page (as long as pages aren't relatively too big compared with chunks, which could unnecessarily slow performance by reading 1 page that contains many chunks to read a few chunks).
chunk size fits within a page but is not a multiple of page size - say you have 8MB pages and 5MB chunks. Is each 5MB chunk stored in 1 page, incurring lots of wasted space (very bad for file size)? Or are some chunks split into multiple pages, and then you have to load, say 16MB of pages to read 10MB of chunk data (kinda bad)? If chunks are split into multiple pages this seems inconsistent with the behavior in (3) that when chunk sizes are larger than page sizes, they are not split into pages.
chunk size is larger than page size - as you indicated above these chunks would just not be grouped into pages and performance does not degrade, since presumably the chunks were just loaded in entirety, similar to pages.

abarciauskas-bgse · 2024-12-20T21:23:09Z

@wildintellect I'm going to go ahead and merge and I can incorporate any add'l feedback from @betolink and @ajelenak as it comes. Thank you for reviewing it!

abarciauskas-bgse added 2 commits November 12, 2024 17:37

Add guidance for CO HDF/NetCDF

9bf3fc2

Add preview

dfb9666

abarciauskas-bgse added 2 commits November 14, 2024 15:57

Add images and refine content

4436dfa

Some refinements

74fa255

abarciauskas-bgse marked this pull request as ready for review November 15, 2024 00:07

abarciauskas-bgse requested a review from wildintellect November 15, 2024 00:10

wildintellect requested changes Nov 21, 2024

View reviewed changes

abarciauskas-bgse added 6 commits November 25, 2024 11:07

Merge branch 'staging' into add-co-hdf5-guidance

0eab830

Fix compression header level

430041f

Add note about compressed vs uncompressed and note that GEDI example …

6afc9a0

…is not cloud-optimized

Add example output of h5dump utility

e0bb5b0

Add intro about zarr and kerchunk.

0a0899d

Modify

af730b6

abarciauskas-bgse added 4 commits December 16, 2024 17:57

Modify

8bd5841

Add callout warning about limitations of h5repack

ed0da8d

Add callout warning about limitations of h5repack

4af9da7

Merge branch 'add-co-hdf5-guidance' of github.com:cloudnativegeo/clou…

8c1ce26

…d-optimized-geospatial-formats-guide into add-co-hdf5-guidance

abarciauskas-bgse added 2 commits December 17, 2024 11:39

Incorporate comments on page size

a39e494

Add links

fa12e10

Self-review

b41046b

abarciauskas-bgse mentioned this pull request Dec 19, 2024

Add determine-chunk-shape notebook #31

Closed

wildintellect self-requested a review December 19, 2024 16:53

wildintellect approved these changes Dec 20, 2024

View reviewed changes

Add a sentence about services

80a61a0

Remove sentence about chunk sizes bigger than page sizes.

a879451

abarciauskas-bgse merged commit ed64546 into staging Dec 20, 2024
3 checks passed

abarciauskas-bgse deleted the add-co-hdf5-guidance branch December 20, 2024 21:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add guidance for CO HDF/NetCDF #121

Add guidance for CO HDF/NetCDF #121

abarciauskas-bgse commented Nov 13, 2024 •

edited

Loading

github-actions bot commented Nov 14, 2024 •

edited

Loading

wildintellect commented Nov 16, 2024

wildintellect left a comment

betolink commented Nov 21, 2024

abarciauskas-bgse commented Nov 25, 2024

abarciauskas-bgse commented Dec 16, 2024 •

edited

Loading

ashiklom commented Dec 17, 2024

abarciauskas-bgse commented Dec 17, 2024

betolink commented Dec 18, 2024

abarciauskas-bgse commented Dec 18, 2024

abarciauskas-bgse commented Dec 19, 2024

wildintellect left a comment

betolink commented Dec 20, 2024

abarciauskas-bgse commented Dec 20, 2024

abarciauskas-bgse commented Dec 20, 2024

abarciauskas-bgse commented Dec 20, 2024

Add guidance for CO HDF/NetCDF #121

Add guidance for CO HDF/NetCDF #121

Conversation

abarciauskas-bgse commented Nov 13, 2024 • edited Loading

github-actions bot commented Nov 14, 2024 • edited Loading

wildintellect commented Nov 16, 2024

wildintellect left a comment

Choose a reason for hiding this comment

betolink commented Nov 21, 2024

abarciauskas-bgse commented Nov 25, 2024

abarciauskas-bgse commented Dec 16, 2024 • edited Loading

ashiklom commented Dec 17, 2024

abarciauskas-bgse commented Dec 17, 2024

betolink commented Dec 18, 2024

abarciauskas-bgse commented Dec 18, 2024

abarciauskas-bgse commented Dec 19, 2024

wildintellect left a comment

Choose a reason for hiding this comment

betolink commented Dec 20, 2024

abarciauskas-bgse commented Dec 20, 2024

abarciauskas-bgse commented Dec 20, 2024

abarciauskas-bgse commented Dec 20, 2024

abarciauskas-bgse commented Nov 13, 2024 •

edited

Loading

github-actions bot commented Nov 14, 2024 •

edited

Loading

abarciauskas-bgse commented Dec 16, 2024 •

edited

Loading